fix tpd panic loop: cxo Finc-to-negative + httputil short-write discriminator by 0pcom · Pull Request #2422 · skycoin/skywire

0pcom · 2026-05-04T02:48:10Z

Summary

Production TPD has been restarting every 30-40 seconds (RestartCount: 556 over 6 hours on the prod host running develop tip 96ee3806c). Two distinct panics, found by greping the container logs, are tearing down the process:

1. `pkg/cxo/skyobject/cache.go` — `(*Cache).Finc` to negative

Stack:

panic: Finc to negative for: <hash>
github.com/skycoin/skywire/pkg/cxo/skyobject.(*Cache).Finc cache.go:1189
github.com/skycoin/skywire/pkg/cxo/skyobject.(*Filler).apply fill.go:225
github.com/skycoin/skywire/pkg/cxo/skyobject.(*Filler).Run.func1 fill.go:297

The filling-refcount goes below zero — likely a duplicate Finc on a Filler.incs map, or an Inc/Finc mismatch across overlapping fillers. The historical behavior was to panic, which became visible after #2420 increased the rate of fill churn (or the bug was always latent and load surfaced it).

Filler.apply and Filler.reject already consume Finc's error return and just log:

if err := f.c.Finc(key, inc); err != nil {
    log.Printf("[CXO] filler apply: failed to increment ref count for %s: %v", key.Hex(), err)
}

So surfacing the inconsistency through that path costs nothing extra. The fix clamps fc to 0, logs the offending key + inc, and continues. Worst case is a leaked filling-item slot — orders of magnitude better than a process kill every 30 seconds.

2. `pkg/httputil/httputil.go` — `WriteJSON` panic on slow-client timeout

Stack:

panic: short write: i/o deadline reached
github.com/skycoin/skywire/pkg/httputil.WriteJSON httputil.go:50
github.com/skycoin/skywire/pkg/transport-discovery/api.(*API).getAllTransports endpoints.go:319

getAllTransports writes ~1MB JSON. On a slow client, net/http's per-write deadline expires mid-response, and the response writer returns an error whose message is "short write: i/o deadline reached". The body is io.ErrShortWrite's text, but the value is net/http.timeoutWriter's own error — errors.Is(err, io.ErrShortWrite) returns false. The fallback string match in isIOError only checked "broken pipe" and "connection reset", so the discriminator returned false and WriteJSON ran the panic branch.

Added "short write", "i/o timeout", and "deadline exceeded" to the string-match fallback. TestIsIOErrorShortWriteVariants pins both sentinel and string-match cases.

Test plan

go build ./... clean.
go vet ./pkg/cxo/... ./pkg/httputil/... clean.
go test ./pkg/httputil/ ./pkg/cxo/skyobject/ ./pkg/transport-discovery/... all pass.
gofmt clean.
New unit test TestIsIOErrorShortWriteVariants covers nil, all four current sentinels, the production short-write case, the i/o-timeout / deadline-exceeded text variants, and the existing broken-pipe / connection-reset cases.
Post-deploy: TPD container RestartCount no longer climbing; docker logs transport-discovery | grep -c panic: reaches 0 within a few minutes.

Notes

Neither bug is caused by the recent latency PRs (#2415, #2418, #2421); the panics' stacks make that explicit. They've just been hammering the deployment more visibly since latency persistence and outlier filtering landed back-to-back, because every TPD bounce now drops registration TTLs on visors that haven't re-pushed yet. Fixing these two restores normal TPD uptime.

…iminator Production TPD was restarting every 30-40s (RestartCount 556 over 6h on the prod host) because two distinct panics tear down the process: 1. pkg/cxo/skyobject/cache.go (*Cache).Finc:1189,1216 panic: "Finc to negative for: <hash>" The filling-refcount went below zero — likely a duplicate Finc on a Filler.incs map, or an Inc/Finc mismatch across overlapping fillers. Hard process kill via panic. Filler.apply / Filler.reject already consume Finc's error return and just log; surface the inconsistency through that path instead. Clamp fc to 0, log the condition with key and the offending inc, and continue. Worst case is a leaked filling-item slot — orders of magnitude better than killing the service. 2. pkg/httputil/httputil.go WriteJSON:50 panic: "short write: i/o deadline reached" isIOError checks errors.Is(err, io.ErrShortWrite), but net/http's timeoutWriter returns its own error value with the same message string when a write deadline expires mid-response. errors.Is misses it (different sentinel), the fallback string match didn't include "short write", so getAllTransports' ~1MB JSON write to a slow client panics on every deadline hit. Added "short write", "i/o timeout", and "deadline exceeded" to the string-match fallback. New TestIsIOErrorShortWriteVariants pins all sentinel and string-match cases. Neither bug is caused by skycoin#2415/skycoin#2418/skycoin#2421; they were just made visible because deploys cycling at 30-40s aren't subtle. Together these stop the panic loop without changing any data semantics.

#2423) #2422 silenced the two dominant TPD panics ('Finc to negative', 'short write: i/o deadline reached'). RestartCount went from ~556/6h to 2/10min. The remaining two panics are nil-derefs in pkg/cxo/node/head.go from torn-down fillHead state: 1. createFiller line 338: cr.c.String() panics when cr.c is nil. handleDelConn (line 517) sets f.p.c = nil when the source connection is removed. f.p.r remains non-zero, so the f.p == (connRoot{}) gate at handleFillingResult does NOT catch the nil-c case, and createFiller(f.p) runs with a nil source. Fix: nil-c short-circuit at the top of createFiller. The fill can't proceed without a source connection — log and return. 2. handleSuccess line 228 (and the parallel sites in handleRequest, handleRequestFailure, handleReceivedRoot): f.fc.PushBack on a nil *list.List. closeFiller (line 384) sets `f.rqo, f.fc, f.rq = nil, nil, nil`, so any in-flight Filler goroutine messages that drain after close land on a nil list. Fix: nil-guard each PushBack/PushFront site (4 sites total). Same recovery as #2422's panic→drop pattern: the closed filler's work is no longer relevant. These are the actual races; the right "real" fix would be to drain or signal-shutdown the channels before nilling state, but that's a larger structural change. Guards stop the process kill while preserving observability.

0pcom merged commit 80fef71 into skycoin:develop May 4, 2026
3 of 4 checks passed

0pcom mentioned this pull request May 4, 2026

cxo/node: guard fillHead nil-deref races (closeFiller + handleDelConn) #2423

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix tpd panic loop: cxo Finc-to-negative + httputil short-write discriminator#2422

fix tpd panic loop: cxo Finc-to-negative + httputil short-write discriminator#2422
0pcom merged 1 commit intoskycoin:developfrom
0pcom:fix/tpd-panics-cxo-finc-and-httputil-shortwrite

0pcom commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

0pcom commented May 4, 2026

Summary

1. pkg/cxo/skyobject/cache.go — (*Cache).Finc to negative

2. pkg/httputil/httputil.go — WriteJSON panic on slow-client timeout

Test plan

Notes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. `pkg/cxo/skyobject/cache.go` — `(*Cache).Finc` to negative

2. `pkg/httputil/httputil.go` — `WriteJSON` panic on slow-client timeout